Skip to content

Conversation

@kimi-p
Copy link
Contributor

@kimi-p kimi-p commented Nov 10, 2022

What does this PR do?

  • Fetch Step Functions tags from AWS resourcegroupstaggingapi api and attach these tags to logs.
  • An integration test for step function is added.
  • The same S3 caching strategy is used as Lambda tags.
  • CloudFormation template is updated and DD_FETCH_STEP_FUNCTIONS_TAGS flag is added with default to true.
  • DEPLOY_TO_SERVERLESS_SANDBOX env var flag is add for installation_test.sh to deploy to Serverless sandbox account.
  • Two S3 buckets are created in Serverless sandbox account, datadog-cloudformation-template-serverless-sandbox and dd-lambda-signing-bucket-serverless-sandbox.
  • This issue about CloudWatch logs not using S3 cache at all is not address in this PR. I will discuss with the team and have another PR to fix it. Hopefully will release this PR and the fix to CW tags caching together after that work is done.

Motivation

logs-to-traces project build Step Functions traces from aws logs. To get the env tag, we'd like to fetch Step Function tags on the forwarder and send these tags with logs to the logs intake. logs-to-traces reducer will then pick up these tags and label traces with the correct env.
https://datadoghq.atlassian.net/browse/SLS-2718

Testing Guidelines

Tested in Serverless sandbox account by running ./installation_test.sh with stack deletion line commented out.

  • logs are tagged with env:staging123 and kimi_test:kimi-test (tags are labeled on the testing state machine)

image

image

  • JSON file used for caching also works as expected

image

  • Verified that CloudWatch logs are also attached

image

Additional Notes

resourcegroupstaggingapi API doc

Testing Forwarder Logs

{
    "PaginationToken": "",
    "ResourceTagMappingList": [
        {
            "ResourceARN": "arn:aws:states:sa-east-1:425362996713:stateMachine:logs-to-traces-complicated-state-machine",
            "Tags": [
                {
                    "Key": "KIMI_TEST",
                    "Value": "kimi-test"
                },
                {
                    "Key": "ENV",
                    "Value": "staging123"
                }
            ]
        }
    ],
    "ResponseMetadata": {
        "RequestId": "333d6c49-0508-44d5-9a43-8940e28b0554",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "333d6c49-0508-44d5-9a43-8940e28b0554",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "243",
            "date": "Tue, 15 Nov 2022 16:24:54 GMT"
        },
        "RetryAttempts": 0
    }
}

Types of changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog
  • This PR passes the integration tests (ask a Datadog member to run the tests)
  • This PR passes the unit tests
  • This PR passes the installation tests (ask a Datadog member to run the tests)

@github-actions github-actions bot added the aws label Nov 10, 2022
@kimi-p kimi-p marked this pull request as ready for review November 16, 2022 16:09
@kimi-p kimi-p changed the title Add Step Functions Tags to logs Forward Step Functions Tags with logs to the backend Nov 16, 2022
@kimi-p
Copy link
Contributor Author

kimi-p commented Nov 16, 2022

Verified again after the scripts are updated.
both env and kimi_test tags are showing up correctly in this trace

@kimi-p kimi-p assigned agocs and DarcyRaynerDD and unassigned agocs and DarcyRaynerDD Nov 17, 2022
Copy link
Contributor

@DarcyRaynerDD DarcyRaynerDD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me overall, left a few questions.

#######################


class StepFunctionsTagsCache(LambdaTagsCache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have three caches, it might be a good idea to split these Cache classes out into their own files.

get_resources_paginator = resource_tagging_client.get_paginator("get_resources")

try:
for page in get_resources_paginator.paginate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, we are following the lambda approach of prefetching the tags from each StepFunction, not the Cloudwatch Log group approach which doesn't fetch anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Returns:
state_machine_tags (List[str]): the list of "key:value" Datadog tag strings
"""
if self._is_expired():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. So, for a specific execution_arn, its tags are keeping changing? What kind of change can it have? Or once its tags are not null, there will be no change for these tags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags are cached for 300 seconds. Every 300 seconds, forwarder will refetch all tag for state machines. Does this answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimi-p My concern is not about TTL. I am just curious about what kind of change can happen for these cached tags. An expired item in the cache can still be usable if the item has not changed at all. For a state machine, its tags can change over time for each execution, while for a specific execution of a state machine (execution_arn), I am wondering what kind of change can happen to its tags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tags are actually on the state machines. So unless tags on these state machines are changing, SF logs' tags won't change. The TTL of 300 seconds is to make sure that the tags are somewhat fresh. I'm not sure if I answered your questions, we can talk about it in the standup.

Copy link
Contributor

@nine5two7 nine5two7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kimi-p
Copy link
Contributor Author

kimi-p commented Nov 18, 2022

Latest changes are:

  • Breaks cache.py into 4 files (base, lambda, cloudwatch log group, step functions)
  • Update .dockerignore to accept all py files sits in the same directory.

@kimi-p
Copy link
Contributor Author

kimi-p commented Nov 21, 2022

I have verified that after the refactor, logs are forwarded and their tags sent correctly.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants